Neste projeto, você irá usar o R e aplicar técnicas de análise exploratória de dados para verificar relações em uma ou mais variáveis e explorar um conjunto de dados específico para encontrar distribuições, outliers e anomalias.
Análise Exploratório de dados (Exploratory Data Analysis, ou EDA) é a análise numérica e visual das características de dados e seus relacionamentos usando métodos formais e estratégias estatísticas.
EDA pode nos trazer insights, que podem nos levar a novas questões, e eventualmente a modelos preditivos. É uma importante “linha de defesa” contra dados ruins e uma oportunidade de comprovar se suas suposições ou intuições sobre um conjunto estão sendo violadas.
Essa análise irá explorar um conjunto de dados de vinhos tintos [Cortez et al., 2009], originalmente construído para modelagem da qualidade do vinho refletida por aspectos químicos de cada bebida. O conjunto de dados tem 1599 registros com 11 variáveis (de aspecto químico) + qualidade do vinho (de 0 a 10). Obtive a ajuda de um amigo formado em química para me guiar em possíveis aspectos quimícos que podem gerar um gosto desagradável no vinho, e sob essas hipoteses guiarei minha analise.
Let’s first take a high-level look at the data to see what we’re dealing with:
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
Looks like we have almost 5 thousand observations containing 11 attributes each (X seems to be a duplicate of the index, so we can probably delete it altogether; quality is the output variable, produced based on those 11 attributes by professional wine judges).
Let’s delete the X variable and run the ncol() function to make sure it worked:
## [1] 12
It did work, so we’re ready to move on.
Let’s build a histogram of wine quality to see what wine made its way into this dataset:
Most ratings seem average (peaking at 6), with quite few excellent and poor wines. It’d be interesting to see how many samples actually got grades of 4 and under or 8 and over.
Poor wines:
## [1] 63
Excellent wines:
## [1] 18
Only slightly more than 7% (363 out of 4898) of all wine samples were given “extreme” grades! Further on throughout the analysis, we’ll label wine samples with grades of 4 and under as poor and with grades of 8 and over as excellent.
Let’s now explore how much alcohol our wine samples contain:
The most common alcohol percentage is around 9.4, with what looks like another peak in the 12-12.4 area, but it might be more insightful to subset our data to see how alcohol percentages are distributed among poor and excellent wines.
At first glance, excellent wines tend to contain more alcohol than poor ones. However, to be more certain, we might want to compute correlation between the quality and alcohol content, which is exactly what we’ll be doing in the Bivariate Plots section of the analysis.
Time to move on to how much residual sugar and salt (that is, chlorides) our wine samples contain. Let’s start with residual sugar:
Apparently, it makes sense to set some limits on the X axis and pick a more granular binwidth.
There’s a spike between 1 and 2 - let’s go even more microscopic and take a closer look at it.
Seems like values are distributed more or less normally in this range, with peaks at 1.2 and 1.4 and a little right tail.
On the whole, we can see a pronounced positive skew in the distribution of residual sugar, so we could try log-transforming this data to get rid of the right tail.
Looks better! The distribution has become bimodal, with the peaks at about 1.3 and 8.
The description that comes with this dataset says that wines with less than 1 gram of residual sugar per liter are quite rare, but the initial histogram we built clearly indicates we have some. Let’s see what those wine samples are:
## [1] 2
Since we’re mostly interested in the quality, it might be a good idea to find out what grades those wines were given.
Most of them showed average results (5 and 6), with a few notable exceptions (3-point and 8-point wines).
Wines containing more than 45 grams of residual sugar per liter are considered sweet. In fact, in our initial histogram, one wine sample with over 60 grams of sugar (wow, that must be really sweet!) jumps right out at you:
Despite its sweetness, it fared pretty well and scored a solid 6 from the judges!
Let’s now explore residual sugar levels in poor and excellent wines:
The modes here are 1 (for poor wines) and 2 (for excellent wines) - doesn’t look like much of a difference. Both the distributions have heavy right tails, each special in its own way: in case of poor wines, the tail is a little longer, whereas with excellent wines there’s a sudden spike at the end of the tail (at 14, to be precise).
Wine tasting notes also contain the word ‘saline’, which refers to how salty a wine is. But can a wine really be salty and how does it affect its quality? Let’s try to find this out by examining the next attribute - chlorides - which indicates the amount of salt in the wine.
Same as with residual sugar, distribution of chlorides is positively skewed, so log-transforming the data might help us solve this issue:
As we can see, there’s a pronounced peak at around 0.048, with the bulk of the data lying between 0.03 and 0.06.
Let’s also take a look at the wine samples with the extreme values:
Both of those were given 5 by the judges, not very impressive.
It’s curious to see what level of chlorides poor and excellent wines actually have:
Almost the entire subset of excellent wines is situated below the point of 0.05, while the poor wine samples are more spread out and the distribution has a long right tail. So, tentatively, wines with higher grades tend to have a lower level of chlorides and therefore be less salty.
Moving on to pH, which describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic).
Most of the wine samples fall in the 3-3.3 range, peaking at 3.22. It would be beneficial to compare pH distributions for excellent and poor wines side by side.
At first glance, no drastic differences that catch the eye. We’ll need to examine the correlation between the pH level and the wine quality more closely in the Bivariate Plots section to be able to draw any conclusions.
Time to explore density. According to the dataset description, wine density is usually close to that of water and largely depends on alcohol content and residual sugar in a particular wine.
We can see a couple of outliers to the right, but the data is densely packed under the value of 1 - zooming in a bit might offer some more insight:
Most of our samples seem to have density that is a bit lower than that of water - 0.992.
Same as with pH, we’re going to examine density of poor and excellent wines separately.
0.993 for poor wines vs 0.991 for excellent wines - to establish whether this difference in peak values is significant, some statistical tests might be necessary, but that’s beyond the scope of this analysis.
In small quantities, it can add ‘freshness’ and flavor to wines. I wonder how much citric acid our wine samples contain.
The data seems normally distributed, with an unexpected spike at around 0.49. Let’s zoom in to see what’s going on there.
There are many wine samples to the right of the mode with the value of 0.49. I wonder what grades those wines got:
Surprisingly, the grades run the gamut from 4 to 9. However, most wine samples that contain 0.49 grams of citric acid per liter scored 6.
Let’s turn to examining concentration of citric acid in poor and excellent wines separately:
The bulk of excellent wines lies between 0.25 and 0.35, whereas in case of poor wines the samples are more spread out, most of them falling in the 0.05-0.5 range. The modes of the two distributions are almost equal: 0.25 for poor wines vs 0.3 for excellent ones.
Next up is sulphates, added to wines as an antimicrobial and antioxidant.
The distribution has a long right tail, so - again - log-transforming the data might help fix this issue:
Now it looks more normally distributed, with a pronounced peak at around 0.53. The initial histogram exposes a few outliers that have more than 1 gram of sulphates per liter - let’s look at those:
Those wines did pretty well, scoring 6 and 7.
As usual, we’ll now examine concentration of sulphates in poor and excellent wines:
The bulk of poor wine samples seems to be more tightly packed, whereas excellent wine samples look more spread out, peaking at 0.45 and 0.4, respectively.
In this subsection, we’ll be looking at concentration of tartaric and acetic acids. The latter, at too high levels, can make a wine taste like vinegar.
We might benefit from a more granular histogram here:
Now it’s easier to see the peak value, which is around 6.5.
In the initial histogram, some outliers are immediately obvious. We’ll examine a few of them more closely:
Almost half of those are poor wine samples and the rest are of medium quality (5 and 6). To check if wine quality drops as tartaric acid concentration increases, we might want to compare this concentration in poor and excellent wines, so that we can draw some tentative conclusion:
Both the peak values seem equal to 7, although the distribution of poor wine samples is more right-skewed, whereas that of excellent wine samples is left-skewed.
We’ll analyze acetic acid the same way as we did tartaric acid.
This distribution has a long right tail, so we can proceed in two ways: just chop the tail off by applying the limit to the X axis, or log-transform our data. Let’s try to do both for a change and see what results we end up with.
We get almost the same peak value of about 0.28, although it’s a bit off to the left (by circa 0.01) in case of log10 transform.
Let’s now take a closer look at some of the outliers that contain over 0.9 grams of acetic acid per liter:
We can see those are poor to medium wines, which seems to be in line with the above statement from the dataset description that claims that higher concentrations of this acid lead to a pronounced taste of vinegar in a wine sample. Of course, to draw any conclusions, a more in-depth analysis is needed, which we’ll undertake in the next sections. For now, we’ll try to find out what concentrations of acetic acid are typical of poor and excellent wines.
Peak values are almost equal, although the figure seems to be a bit greater for poor wines (0.28 vs 0.26). Moreover, the distribution of poor wines has a longer right tail that extends beyond higher values than that of excellent wines.
Here, we’ll be exploring SO2 levels in our wine samples, starting off with free SO2.
Almost all the wine samples sit under the value of 100, so we might want to zoom in a bit:
This distribution seems to peak at a value close to 30.
Let’s also examine the outlier situated far off to the right in the first histogram:
Turns out it’s quite a low-quality wine sample.
Time to compare how much free SO2 is contained in poor and excellent wines:
Looks like lower levels of free SO2 are more typical of poor wines, with the peak at 2 and the bulk of the data sitting between 2 and 37. For excellent wines, most wine samples fall in the 27-47 range, with the peak at 29. One more interesting thing: the poor wines distribution here is the only one so far that looks like an exponential one, which means higher levels of free SO2 are much more rarely observed in poor wines than low ones.
Since total SO2 = free SO2 + bound SO2, we can create a new variable for bound SO2 and analyze it separately, same as we did with free SO2.
Let’s see some high-level information about our new variable:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 12.00 21.00 30.59 39.00 251.50
Now we can build a few histograms to better understand how it behaves.
We can clearly see an outlier containing over 300 mg/l of bound SO2 - let’s go microscopic on it:
Same as with free SO2 above, it’s also a low-quality wine, with a score of 3.
Now we’ll apply some breaks and limits to the X axis to be able to see separate values more clearly:
The peak’s become more discernible - it’s about 82.
Moving on to comparing the levels of bound SO2 in poor and excellent wines.
The peaks for the poor wines and excellent wines distributions are about 104 and 76, respectively. The bulk of the poor wine samples falls in a wider range between 44 and 168, whereas it’s just between 56 and 112 for excellent wines. Based only on this quick visual comparison, we can tentatively say that excellent wines tend to contain less bound SO2 than poor wines do.
The last to be analyzed in this subsection is the total level of SO2, which, I assume, must strongly correlate with both the level of free SO2 and the level of bound SO2 since it’s just the sum of the two. Let’s see if the total SO2 histograms we build are very different from what we had for free and bound SO2.
This histogram also shows an outlier with a pretty high total level of SO2. I believe it’s the same wine sample that we looked at above, when dealing with either free or bound SO2, but let’s check it to be sure:
Indeed, it’s the very same low-quality wine sample that we picked out earlier in the analysis.
Same as with free and bound SO2, let’s build a more granular histogram to see more clearly how the values are distributed:
Judging by the histogram, the mode of this distribution is somewhere near 114.
What’s left for us is to explore the total levels of SO2 across poor and excellent wines:
The poor wines histogram peaks at 109 and then at 189, whereas the excellent wines histogram shows two distinct peaks situated fairly close to each other - at 99 and 119. Also, poor wine samples are more spread out across the X axis, and the poor wines distribution seems to have a left tail.
The dataset contains 4898 wine samples with 11 attributes (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol; I’m not counting the X variable here, since, like I said above, it’s just a duplicate of the index, so I dropped it from the dataset before going on with the analysis) and a final grade each sample received from the professional wine judges based on those attributes.
The main feature is quality, because this whole analysis is driven by the question “what influences the quality of white wine?”. In the next two sections, I’m going to focus on exploring the relationships between quality and other features and their combinations.
Based on the univariate analysis I’ve performed so far, I have a reason to believe features like alcohol content, chlorides, levels of SO2, fixed and volatile acidity might be more or less reliable indicators of wine quality, but it’s hard to say anything for sure until bivariate and multivariate analyses are carried out and feature relationships are explored in various ways.
Since I had data on both free SO2 and total SO2 in wines, I created a new variable called bound sulfur dioxide (SO2) by subtracting free SO2 from total SO2. I’m yet to analyze this variable closer in the next sections of my analysis, but for now it seems like higher-quality wine samples tend to contain slightly lower levels of bound SO2.
I didn’t have to clean anything or fill any gaps as this dataset is prepared in such a way that there’s no missing data in it.
As I mentioned above, at the beginning of my analysis, I got rid of the X variable since it was just a duplicate of the index and didn’t help me in any way.
All the remaining features in the dataset seem more or less normally distributed, but some of the distributions are positively skewed: residual sugar, chlorides, sulphates, volatile acidity. I log-transformed (log10) all of them to solve the issue of long tails. In fact, the residual sugar distribution turned out to be bimodal, with peaks at 1.3 and 8.
One thing I noticed is that concentration of almost all the substances is given in grams per cubic decimeter (or grams per liter, which is the same thing, and I prefer this notation), with a notable exception of levels of SO2, which are given in milligrams per liter. Further down the road, it might be worthwhile to convert them to grams per liter to see if anything changes. Same story with density, which can later be converted from grams per cubic centimeter to grams per liter to see if that transformation brings anything new and unexpected to the analysis.
In this section, relationships between pairs of features will be examined. One such relationship is correlation, and the quickest way to obtain pairwise correlations for the whole dataset is to use a ggpairs() function from a library called GGally.
We can see a more or less pronounced (I defined the threshold to be abs(0.35)) correlation between the following pairs:
Positive:
Negative:
Our main variable, quality, is correlated the most with alcohol (0.436), density (-0.307), bound sulfur dioxide (-0.218), chlorides (-0.21), and volatile acidity (-0.195).
It would make sense to concentrate our efforts on studying the identified correlated pairs more carefully.
In this subsection, we’ll look at how quality varies with the rest of the features and try to find out if any feature allows definitely telling a good wine from a bad one.
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.700 7.150 7.500 8.360 9.875 11.600
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.600 6.800 7.500 7.779 8.400 12.500
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.000 7.100 7.800 8.167 8.900 15.900
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.700 7.000 7.900 8.347 9.400 14.300
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.900 7.400 8.800 8.872 10.100 15.600
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.000 7.250 8.250 8.567 10.225 12.600
Best wines in the dataset have the highest minimum fixed acidity (6.6) and one of the highest medians, along with the worst wines. At the same time, wine samples rated 8 or 9 have the two lowest maximum levels of fixed acidity - 8.2 and 9.1.
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4400 0.6475 0.8450 0.8845 1.0100 1.5800
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.230 0.530 0.670 0.694 0.870 1.130
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.180 0.460 0.580 0.577 0.670 1.330
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1600 0.3800 0.4900 0.4975 0.6000 1.0400
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3000 0.3700 0.4039 0.4850 0.9150
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2600 0.3350 0.3700 0.4233 0.4725 0.8500
This box plot shows something of a wave-like pattern in terms of median volatile acidity: it starts growing, reaches its peak, then hits the bottom at 6 and 7, and grows again towards the best wine samples. One thing to note here is that the best wine samples have the highest minimum (0.24) and the lowest maximum (0.36) values of volatile acidity.
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0050 0.0350 0.1710 0.3275 0.6600
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0300 0.0900 0.1742 0.2700 1.0000
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2300 0.2437 0.3600 0.7900
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2600 0.2738 0.4300 0.7800
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.3050 0.4000 0.3752 0.4900 0.7600
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0300 0.3025 0.4200 0.3911 0.5300 0.7200
Median level of citric acid doesn’t seem to vary too much across different grades, except for two spikes at 3 and 9. Best and worst wine samples have the highest minimum (.29 and 0.21, respectively) and lowest maximum values (0.49 and 0.47.respectively) of citric acid concentration.
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 1.875 2.100 2.635 3.100 5.700
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.300 1.900 2.100 2.694 2.800 12.900
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 1.900 2.200 2.529 2.600 15.500
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.477 2.500 15.400
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 2.000 2.300 2.721 2.750 8.900
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.400 1.800 2.100 2.578 2.600 6.400
We can see pronounced fluctuations of the median level of residual sugar across the wine grades, the highest being 7 (grade 5) and the lowest 2.2 (grade 9). However, the sweetest wine sample in the dataset (65.8) has a grade of 6. Once again, the best wine samples have the highest minimum (1.6) and lowest maximum (10.6) levels of residual sugar.
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0610 0.0790 0.0905 0.1225 0.1430 0.2670
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04500 0.06700 0.08000 0.09068 0.08900 0.61000
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03900 0.07400 0.08100 0.09274 0.09400 0.61100
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03400 0.06825 0.07800 0.08496 0.08800 0.41500
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.06200 0.07300 0.07659 0.08700 0.35800
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04400 0.06200 0.07050 0.06844 0.07550 0.08600
Wine samples graded 5 have the highest median level of chlorides (0.047), and after that concentration goes downward and hits the bottom at grade 9 - 0.0274. The best wine samples also have the lowest maximum level of chlorides, which is 0.035 - it’s at least 3.5 times lower than the runner-up (0.121) and almost 10 times lower than the greatest value in the dataset (0.346).
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.0 5.0 6.0 11.0 14.5 34.0
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 6.00 11.00 12.26 15.00 41.00
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 9.00 15.00 16.98 23.00 68.00
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 8.00 14.00 15.71 21.00 72.00
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 6.00 11.00 14.05 18.00 54.00
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 6.00 7.50 13.28 16.50 42.00
Wine samples with the grade of 4 seem to have the lowest median level of free SO2 among all and the second greatest maximum level of free SO2 (138.5), topped only by wine samples graded 3 (max value - 289). Most values seem to be lying below the threshold of 50 (the maximum third quartile), above which SO2 might become evident in the nose and taste of wine and influence its quality.
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.00 6.75 11.00 13.90 13.75 37.00
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 8.00 14.00 23.98 32.00 107.00
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 14.00 29.00 39.53 58.00 128.00
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 11.00 19.00 25.16 33.00 126.00
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.00 8.50 15.00 20.97 21.50 251.50
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 9.25 11.00 20.17 22.75 76.00
Median bound SO2 level of better wines tends to lie below 100 (true for grades 6 through 9), which seems to be in line with the negative correlation (about -0.2) we’ve discovered earlier, and it reaches the lowest value of 82 at grade 9. As was to be expected, the worst wines have the highest maximum level of bound SO2 (331), whereas the best ones have the lowest maximum level, 112 - a 3-time difference!
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 12.5 15.0 24.9 42.5 49.0
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 14.00 26.00 36.25 49.00 119.00
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 26.00 47.00 56.51 84.00 155.00
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 23.00 35.00 40.87 54.00 165.00
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 17.50 27.00 35.02 43.00 289.00
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.00 16.00 21.50 33.44 43.00 88.00
Since total SO2 = free SO2 + bound SO2, we can observe the same patterns as above, when we analyzed levels of free and bound SO2 in wines. For example, the median level of total SO2 tends to lie below 150 for better wines, with a notable exception of grade 4, which has the lowest median value of all - 117 (heavily influenced by the low level of free SO2 for this grade).
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9947 0.9961 0.9976 0.9975 0.9988 1.0008
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9934 0.9957 0.9965 0.9965 0.9974 1.0010
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9926 0.9962 0.9970 0.9971 0.9979 1.0031
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9954 0.9966 0.9966 0.9979 1.0037
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9906 0.9948 0.9958 0.9961 0.9974 1.0032
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9908 0.9942 0.9949 0.9952 0.9972 0.9988
The median density tends to decrease as the quality grows, the only group that breaks this trend is wine samples of grade 5, which have the highest median density of all - 0.9953. This finding seems to be in line with what we’ve discovered previously: to refresh our memory, quality vs density is the strongest negative correlation for our main variable (-0.307).
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.160 3.312 3.390 3.398 3.495 3.630
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.300 3.370 3.382 3.500 3.900
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.880 3.200 3.300 3.305 3.400 3.740
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.860 3.220 3.320 3.318 3.410 4.010
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.920 3.200 3.280 3.291 3.380 3.780
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.880 3.163 3.230 3.267 3.350 3.720
Most wine samples lie in the range between 3 and 3.3 on the pH scale, and the median values fit into an even narrower range of 3.15-3.3, varying in a slightly discernible bowl-like fashion: it all starts at 3.215, gradually falls to 3.16, and then starts growing again, peaking at 3.28 (wine samples graded 9).
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4000 0.5125 0.5450 0.5700 0.6150 0.8600
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.4900 0.5600 0.5964 0.6000 2.0000
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.370 0.530 0.580 0.621 0.660 1.980
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4000 0.5800 0.6400 0.6753 0.7500 1.9500
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3900 0.6500 0.7400 0.7413 0.8300 1.3600
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6300 0.6900 0.7400 0.7678 0.8200 1.1000
All the median values sit between 0.4 and 0.5, with the worst and best wine samples having the lowest median (as well as maximum) levels of sulphates. Otherwise, there’s little change across median pH values.
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.400 9.725 9.925 9.955 10.575 11.000
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 9.60 10.00 10.27 11.00 13.10
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.5 9.4 9.7 9.9 10.2 14.9
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.80 10.50 10.63 11.30 14.00
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.50 11.47 12.10 14.00
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.80 11.32 12.15 12.09 12.88 14.00
This box plot reinforces our earlier finding saying that there’s a strong positive correlation (0.436) between alcohol content and wine quality - turns out it’s especially true for wine samples of higher grades, whereas for lower-quality wines the trend is actually downward - the quality improves with lower alcohol levels. The best wines have the highest median level of alcohol, which seems to be significantly different from some of the other median values of lower-quality wines. It means that in this case the median can be used as a more or less reliable predictor of wine quality: if it’s below 12, a wine sample couldn’t have scored more than 7. Of course, such conclusions are restricted to our dataset only - the situation might be quite different for the whole population of wines.
Let’s also build a few color-coded density plots for some of the features that formed the most strongly correlated pairs.
In all these plots, we can clearly see a bimodal distribution for the best wines. I guess this effect is due to there being very few wine samples with grade 9 in the dataset that take on just several values. Our analysis might have benefited from a greater number of highest-quality wines, as we could’ve checked whether this pronounced bimodality has to do with insufficient data or there’re some other factors at play.
As for the last density plot for residual sugar, the distributions seem quite skewed - and indeed, in the first section of this analysis, we’ve found out that the residual sugar distribution has a very heavy right tail. Let’s now try rebuilding the same density plot, but with the residual sugar variable log-transformed.
Now it becomes obvious this distribution is actually bimodal across all wine grades! Pretty curious finding that I can’t explain right away for the lack of the domain knowledge. It might even be a phenomenon peculiar to Portuguese wines - it’s really hard to tell without having more data handy.
The strongest positive correlation involving quality is quality vs alcohol (0.436). One particularly interesting thing here is that an upward trend (quality increases as alcohol content grows) holds true only for higher-quality wines, starting from the grade of 6; below this point the trend is actually downward: for wine samples graded 3-5, the lower the alcohol level, the better the wine. The median alcohol value of less than 12 indicates that a wine sample’s maximum score is 7, which might help us tell a good wine sample from a poor one.
The most pronounced negative correlation that has to do with our main feature is observed in the pair quality - density (-0.307). The general trend there is a downward one: with each grade, median density decreases a bit, with a notable exception of one group - wine samples of grade 5, which break this trend and actually have the greatest median density of all grades. The exactly same picture can be seen in quality vs bound SO2 (-0.218): grade 5 wines once again break the generally downward trend.
Another interesting pattern was discovered in the pair quality vs volatile acidity (-0.195): median values there seemed to change in a wave-like fashion from one grade to another, going up and down a few times.
One more curious finding was that the residual sugar distribution, which is highly skewed initially, when log-transformed and color-coded by quality, is actually bimodal across all the wine grades, from lowest to highest. As I said above, under the relevant plot, I might be lacking some specialist knowledge to draw the right conclusion based on this fact, or it might just be a peculiarity of Portuguese wines, white ones in particular.
Fun fact: positive correlations were dominated by density (3 occurrences out of 6), negative ones by alcohol, featured even more prominently (5 occurrences out of 6). Therefore, it’s only natural that these two features produced the most highly correlated pairs (which I’m talking about in more detail in the subsection below), and density had a part in both of them!
Among other things, total SO2 and bound SO2 turned out to be positively correlated with both density and residual sugar. As for the negative correlation, one of the strongest relationships were observed in such pairs as: total SO2 and free SO2 vs alcohol; pH vs fixed acidity; alcohol vs residual sugar and chlorides.
Surprisingly enough, the two most pronounced correlations didn’t involve the main variable, quality, but instead featured density, which seems to be heavily dependent on both residual sugar and alcohol content. In the former case, the correlation is positive and equals 0.839; in the latter case, the features are negatively correlated (-0.78).
In the previous section, we used box plots to see how different variables are distributed across wine grades and scatter plots to discover interesting pairwise relationships between the features. This section allows us to take our analysis one step further by combining the two techniques and examining what relationships the features display (and how these relationships vary) across wine grades.
Let’s first take a look at a couple of scatter plots for the features that exhibited the strongest correlation, faceted by quality.
Looks like no surprises here. Scatter plots demonstrate the same trends across all wine quality grades: upward for density vs residual sugar and downward for density vs alcohol.
I wonder what plots would look like for less correlated features.
For the lowest-quality wines, alcohol doesn’t seem to be correlated with residual sugar at all, with a negative trend becoming more noticeable towards higher wine grades.
Somewhat similar picture here. In case of the worst and best wines, alcohol and total So2 are much less correlated (if correlated at all) as compared with wine samples of other grades, which all display a more prominent downward trend.
This time the weakest correlation between the features takes place with the best wine samples. In all other cases, an upward trend is obvious.
We’ll now build a pretty straightforward linear model to see how well it can predict wine quality based on the features we’ve analyzed.
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wine)
## m2: lm(formula = quality ~ alcohol + residual.sugar, data = wine)
## m3: lm(formula = quality ~ alcohol + residual.sugar + density, data = wine)
## m4: lm(formula = quality ~ alcohol + residual.sugar + density + volatile.acidity,
## data = wine)
## m5: lm(formula = quality ~ alcohol + residual.sugar + density + volatile.acidity +
## pH, data = wine)
## m6: lm(formula = quality ~ alcohol + residual.sugar + density + volatile.acidity +
## pH + sulphates, data = wine)
## m7: lm(formula = quality ~ alcohol + residual.sugar + density + volatile.acidity +
## pH + sulphates + free.sulfur.dioxide, data = wine)
##
## =========================================================================================================================
## m1 m2 m3 m4 m5 m6 m7
## -------------------------------------------------------------------------------------------------------------------------
## (Intercept) 1.875*** 1.882*** -42.884*** -24.273* -13.811 -0.150 2.280
## (0.175) (0.176) (12.051) (11.433) (11.858) (11.944) (12.107)
## alcohol 0.361*** 0.361*** 0.401*** 0.339*** 0.346*** 0.325*** 0.320***
## (0.017) (0.017) (0.020) (0.019) (0.019) (0.019) (0.020)
## residual.sugar -0.004 -0.026 -0.016 -0.015 -0.007 -0.003
## (0.013) (0.014) (0.013) (0.013) (0.013) (0.013)
## density 44.547*** 27.216* 17.881 3.630 1.209
## (11.990) (11.367) (11.702) (11.812) (11.975)
## volatile.acidity -1.359*** -1.272*** -1.154*** -1.160***
## (0.096) (0.099) (0.100) (0.100)
## pH -0.383** -0.303* -0.290*
## (0.119) (0.119) (0.119)
## sulphates 0.628*** 0.642***
## (0.104) (0.105)
## free.sulfur.dioxide -0.002
## (0.002)
## -------------------------------------------------------------------------------------------------------------------------
## R-squared 0.227 0.227 0.233 0.319 0.324 0.339 0.340
## adj. R-squared 0.226 0.226 0.232 0.318 0.322 0.336 0.337
## sigma 0.710 0.711 0.708 0.667 0.665 0.658 0.658
## F 468.267 234.040 161.879 187.064 152.580 136.047 116.861
## p 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1721.057 -1721.016 -1714.127 -1618.932 -1613.786 -1595.704 -1594.954
## Deviance 805.870 805.829 798.915 709.235 704.685 688.926 688.280
## AIC 3448.114 3450.031 3438.254 3249.864 3241.573 3207.408 3207.908
## BIC 3464.245 3471.540 3465.139 3282.127 3279.213 3250.425 3256.302
## N 1599 1599 1599 1599 1599 1599 1599
## =========================================================================================================================
The variables in this linear model can account for 28% of the variance in the quality of white wine.
The most prominent correlations we’ve discovered were in fact so strong that, when faceted by wine quality, the features displayed the same trends across all wine grades: for density vs residual sugar, the trend was always upward, for density vs alcohol always downward.
For other, less correlated features (alcohol vs residual sugar, alcohol vs total SO2, density vstotal SO2), the trend across the wine grades was also the same, with an exception of best or worst wines, or both, whereby features showed little to no correlation whatsoever.
Since the correlation between density and residual sugar was quite higher than that of density and alcohol (0.839 vs -0.78), I was epsecially interested to see how residual sugar and alcohol were correlated and expected at least a slightly positive correlation. To my surprise, the correlation turned out to be strongly negative (-0.451, second strongest among negative correlations discovered); in fact, it was so strong that a negative downward trend manifested itself across 6 out 7 wine grades represented in the dataset, except for grade 3, where features showed no correlation at all.
Yes, I did create a linear model that makes a prediction based on 7 features from the dataset. Further increasing the number of features didn’t yield any significant improvement, so I stopped at this value. Surprisingly enough, the model explains a mere 28% of the variance in the target variable, which is quality. It seems like wine quality is not well supported by its physico-chemical properties. Two things to note here: first, quality of prediction could be improved with more data (right now, it’s less than 5,000 samples); second, there’re some other factors at play, so the model might have benefited from addition of such variables as price of wine, region where it was produced, year it was produced and other things not related to wine chemistry. Trying out other models may also lead to better results. Say, I have a hunch that tree-based methods would do well in this case.
This box plot supports our finding saying that the strongest positive correlation our main variable of interest is involved in is quality vs alcohol (0.436). An interesting thing here is that for lower wine grades, we can actually observe a negative downward trend that gets reversed only from grade 5 onwards. Thus, for wines of up to grade 5, the lower the alcohol content, the better a wine tends to be; after that wine quality grows linearly with increasing alcohol content.
Moreover, the median (and mean as well) alcohol content of best wines looks significantly different from that of worst wines, which can be used to more or less reliably tell a quality wine from a poor one.
When plotted unmodified, the residual sugar distribution is highly skewed and has a long right tail. However, when log-transformed, the distribution becomes bimodal. When I later color-coded the plot, I saw the distribution was in fact bimodal across all the wine grades. Intrigued by this phenomenon, I read a few specialized articles on residual sugar in wines, but couldn’t find any explanation that would satisfy me. Therefore I’m inclined to think, for the lack of proof to the contrary, that it’s just a regional thing specific to Portuguese wines.
This faceted scatter plot illustrates the third strongest negative correlation discovered during the analysis - alcohol vs total SO2. Each subplot contains a line of best fit that visually reinforces the trend across wine grades. One interesting observation here is that with best and worst wines, the features display little to no correlation whatsoever, whereas for wines of grades 4 through 8, a clearly negative downward trend manifests itself. It might be an indication of the fact that this particular combination of features is a bad candidate for predicting wine quality. Indeed, when I was building a linear model, alcohol turned out to be the best contributor to the overall quality of prediction, whereas total SO2 added absolutely nothing to improve it and therefore was not included in the resulting model.
The dataset I’ve analyzed contains information on almost 5,000 white wines across 11 variables plus the output variable based on sensory data, that is a grade on a scale of 0 to 10 given to each wine sample by professional wine judges. This dataset is restricted to Portuguese wines and contains only their physico-chemical properties.
I began my analysis by building histograms of each feature to understand their distribution. They turned out to be normally distributed, with a few notable exceptions (take residual sugar as an example), where I observed heavy skew and long tails. Log-transforming these variables helped me deal with this abnormality. I also defined thresholds for poor (grade 4 and under) and excellent (grade 8 and over) wines, then subset my dataset using these thresholds and plotted distributions of individual features across poor and excellent wines side by side. This helped me see whether these distributions were very different and identify a few potential candidates that could be useful in telling a low-quality wine from a better one.
I went on to explore pairwise relationships between the features and pick out the most strongly correlated (both positively and negatively) pairs to focus my analysis on them. To my surprise, the main variable of interest - quality - wasn’t involved in any of the strongest correlations identified. I built a few scatter plots and included a line of best fit for each of them to more clearly see the general trend in the data points. Then I added a few box plots that reinforced my earlier findings and offered some new insights.
My greatest success was finding out that alcohol content was the most influential feature that could more or less reliably be used to differentiate between poor and excellent wines. Indeed, when I later built a linear model to predict a wine grade, this feature alone contributed over 70% to the overall prediction quality.
In the final part of my analysis, I used wine grades to color-code and facet a few plots that I’d built previously to see if any variables reinforce each other across any of the wine grades. The main finding here was that in the two most strongly correlated pairs the corelation was so pronounced that the trend stayed the same across all wine grades: it was always upward for density vs residual sugar and downward for alcohol vs density. The situation was a bit different for more weakly correlated pairs: the trend did stay the same across most wine grades, but with worst or best wines, the features I was analyzing displayed little to no correlation at all (for example, alcohol vs total SO2), which signaled these combinations were probably not the best predictors of wine quality. I tested these findings when building a linear model and excluded the worst contributors from the final version.
I’ve also bumped into a couple of obstacles along the way. First, I found out that the residual sugar distribution, when log-transformed, is bimodal across all wine grades. I’ve been struggling to explain this phenomenon for some time and even read a few specialized articles on the topic, but found no satisfactory explanation so far. So I’m inclined to believe this phenomenon is specific to Portuguese wines, since that’s what I’ve been analyzing all along.
Another thing I had difficulties with was the linear model that I’d built. It was able to explain only 28% of the variance in wine quality, which I found to be a pretty poor result. At first, I thought I was doing something wrong and actually spent a couple of days trying to engineer new features and combine them in various ways (to no avail), but then I realized that some other factors were at play and physico-chemical properties alone were not enough of a quality predictor.
And this realization leads me to suggestions on how to improve this analysis. First and foremost, more data would be nice. 5,000 wine samples is alright, but given the number of wines in the world, it’s just a drop in the ocean. Besides, the dataset is restricted to only Portuguese wines, which significantly limits its value and ability to represent the whole population. Second, as I mentioned above, there must be some other features that heavily influence wine quality. Better results might have been obtained if we had information about a region where a wine was produced, the year it was produced, grape type, selling price and wine brand, to name a few. Also, it might be a good idea to test other kinds of models and see how they fare against each other. I guess more powerful models, like SVM or tree-based methods, could have demonstrated impressive results.